QLISP.TEX[PRO,JMC] - www.SailDart.org

perm filename QLISP.TEX[PRO,JMC] blob sn#809773 filedate 1986-02-03 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00045 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00004 00002	\input macros[pap,rpg]
C00006 00003	\paper:APPENDIX B.
C00008 00004	\sect Design Goals
C00010 00005	\sect This Paper
C00011 00006	\sect QLET
C00016 00007	\ssect Queue-based
C00019 00008	\ssect Example \qlet/
C00021 00009	\ssect Functions
C00025 00010	\ssect A Real Example
C00026 00011	\sect \qlambda/ Closures
C00030 00012	\ssect Value-Requiring Situations
C00036 00013	\sect CATCH and QCATCH
C00039 00014	\ssect \catch/
C00042 00015	\ssect \qcatch/
C00046 00016	\ssect \throw/
C00048 00017	\sect \unwpro/
C00051 00018	\ssect Other Primitives
C00059 00019	\ssect An Unacceptable Alternative
C00063 00020	\sect The Rest of the Paper
C00066 00021	\sect Resource Management
C00072 00022	\ssect Fine Points
C00073 00023	\sect Locks
C00081 00024	\sect Killing Processes
C00085 00025	% Bomb Test  Example
C00090 00026	\sect Eager Process Closures
C00102 00027	\sect Performance
C00104 00028	\ssect The Simulator
C00107 00029	\ssect Fibonacci
C00110 00030	\ssect Adding Up Leaves
C00117 00031	\ssect Data Structures
C00118 00032	\ssect Traveling Salesman
C00129 00033	\ssect Browse
C00132 00034	\sect Software Pipelining
C00133 00035	\ssect Pipelining
C00137 00036	\ssect Software Pipelining
C00141 00037	\sssect Simple Example
C00150 00038	\sssect Global Variable Example
C00169 00039	\ssect Comparison with Other Techniques
C00174 00040	\sect Geometric Control Structures
C00176 00041	\ssect Motivation for Geometric Control Structures
C00180 00042	\ssect Data-Structure-resident Closures
C00183 00043	\sect Conclusions
C00185 00044	\sect Acknowledgments
C00186 00045	\references
C00195 ENDMK
C⊗;
\input macros[pap,rpg]
\magnification\magstep1
%\nopagenumbers
%\smallerskips
\def\qlambda/{{\bf QLAMBDA}}
\def\qlet/{{\bf QLET}}
\def\lambda/{{\bf LAMBDA}}
\def\catch/{{\bf CATCH}}
\def\throw/{{\bf THROW}}
\def\qcatch/{{\bf QCATCH}}
\def\unwpro/{{\bf UNWIND-PROTECT}}
\def \V/{$*\hbox{{\bf V}}*$}
\def \Vhand/{$*\hbox{{\bf V-HANDLER}}*$}
\def\Y/{{\bf Y}}
\def\QY/{{\bf QY}}
\def\release-lock/{{\bf RELEASE-LOCK}}
\def\llet/{{\bf LET}}
\paper:APPENDIX B.
\vskip .5truein
\centerline{\bf Qlisp: An Overview}
\vskip .5truein
\centerline{Richard P. Gabriel}
\vskip .5truein
\centerline{Stanford University}
\vskip .5truein

\sect Introduction

As the need for high-speed computers increases, the need for
multi-processors will be become more apparent. One of the major stumbling
blocks to the development of useful multi-processors has been the lack of
a good multi-processing language---one which is both powerful and
understandable to programmers.

Among the most compute-intensive programs are artificial intelligence (AI)
programs, and researchers hope that the potential degree of parallelism in
AI programs is higher than in many other applications.  In this paper we
propose multi-processing extensions to Lisp.  Unlike other proposed
multi-processing Lisps, this one provides only a few very powerful and
intuitive primitives rather than a number of parallel variants of familiar
constructs.

This language is called Qlisp.

\sect Design Goals

\item{1.}Because Lisp manipulates pointers, this Lisp dialect will run in
a {\sl shared-memory} architecture;

\item{2.}Because any real multi-processor will have only a finite number of
CPU's, and because the cost of maintaining a process along with its communications
channels will not be zero, there must be a means to limit the degree of
multi-processing at runtime;

\item{3.}Only minimal extensions to Lisp should be made to help programmers use the
new constructs;

\item{4.}Ordinary Lisp constructs should take on new meanings in the multi-processing
setting, where appropriate, rather than proliferating new constructs.

\item{5.}The constructs should all work in a uni-processing setting (for
example, it should be possible to set the degree of multi-processing to 1
as outlined in point 2); and

\sect This Paper

This paper presents the added and re-interpreted Lisp constructs, and
examples of how to use them are shown. A simulator for the language
has been written and used to obtain performance estimates on
sample problems. This simulator and some of the problems are be briefly
presented.

\sect QLET

The obvious choice for a multi-processing primitive for Lisp is
one which evaluates arguments to a lambda-form in parallel.
\qlet/ serves this purpose. Its form is:

$$\vbox{\settabs\+\cr
\+({\bf QLET} &{\it pred} &(&($\hbox{\it x}↓1$& $\hbox{\it arg}↓1$)\cr
\+&&&&$\vdots$\cr
\+&&&($\hbox{\it x}↓n$ $\hbox{\it arg}↓n$))\cr
\+&. {\it body})\cr}$$

{\it Pred} is a predicate that is evaluated before any other action
regarding this form is taken; it is assumed to evaluate to one of: (),
{\bf EAGER}, or something else.

If {\it pred} evaluates to (), then the \qlet/ acts exactly as a {\bf
LET}. That is, the arguments $\hbox{\it arg}↓1\ldots\hbox{\it arg}↓n$
are evaluated as usual and their values bound to 
$\hbox{\it x}↓1\ldots\hbox{\it x}↓n$, respectively.

If {\it pred} evaluates to non-(), then the \qlet/ will cause some
multi-processing to happen.  Assume {\it pred} returns something other
than () or {\bf EAGER}. Then processes are spawned, one for each
$\hbox{\it arg}↓i$. The process evaluating the \qlet/ goes into a wait
state: When all of the values $\hbox{\it arg}↓1\ldots\hbox{\it arg}↓n$ are
available, their values are bound to $\hbox{\it x}↓1\ldots\hbox{\it x}↓n$,
respectively, and each form in the list of forms, {\it body}, is
evaluated.

Assume {\it pred} returns {\bf EAGER}. Then \qlet/ acts exactly as above,
except that the process evaluating the \qlet/ does not wait: It proceeds
to evaluate the forms in {\it body}. But if in evaluating the forms in
{\it body} the value of one of the arguments is required, $\hbox{\it
arg}↓i$, the process evaluating the \qlet/ waits. If that value has been
supplied already, it is simply used.

To implement {\bf EAGER} binding, the value of the {\bf EAGER} variables
could be set to an `empty' value, which could either be an empty memory
location, like that supported by the Denelcor HEP \ref:Smith.1978., or a
Lisp object with a tag field indicating an empty or pending object. At
worst, every use of a value would have to check for a full pointer.

We will refer to this style of parallelism as \qlet/ {\it application}.

\ssect Queue-based

The Lisp is described as `queue-based' because the model of computation is
that whenever a process is spawned, it is placed on a global queue of
processes.  A scheduler then assigns that process to some processor.  Each
processor is assumed to be able to run any number of processes, much as a
timesharing system does, so that regardless of the number of processes
spawned, progress will be made. We will call a process running on a
processor a {\it job}.

The ideal situation is that the number of processes active at any one time
will be roughly equal to the number of physical processors
available.\note{Strictly speaking this isn't true. Simulations show that
the ideal situation depends on the length of time it takes to create a
process and the amount of waiting the average process needs to do. If the
creation time is short, but realistic, and if there is a lot of waiting
for values, then it is better to use some of the waiting time creating
active processes, so that no processor will be idle.  The ideal situation
has no physical processor idle.}

The idea behind {\it pred}, then, is that at runtime it is desirable to
control the number of processes spawned. Simulations show a marked dropoff
in total performance as the number of processes running on each processor increases,
{\it assuming that process creation time is non-zero.}

\ssect Example \qlet/

Here is a simple example of the use of \qlet/.
The point of this piece of code is to apply the function {\bf CRUNCH}
to the $\hbox{n}↓1↑{\hbox{th}}$  element of the list $\hbox{L}↓1$,
the $\hbox{n}↓2↑{\hbox{th}}$  element of the list $\hbox{L}↓2$, and
the $\hbox{n}↓3↑{\hbox{th}}$  element of the list $\hbox{L}↓3$.

$$\vbox{\settabs\+\cr
\+(&{\bf QLET} &T &(&(&X\cr
\+&&&&&({\bf DO} &(&(L $\hbox{L}↓1$ ({\bf CDR} L))\cr
\+&&&&&&&(I 1 ({\bf 1+} I))\cr
\+&&&&&&(($=$ I $\hbox{N}↓1$) ({\bf CAR} L)))))\cr
\+&&&&(&Y\cr
\+&&&&&({\bf DO} &(&(L $\hbox{L}↓2$ ({\bf CDR} L))\cr
\+&&&&&&&(I 1 ({\bf 1+} I))\cr
\+&&&&&&(($=$ I $\hbox{N}↓2$) ({\bf CAR} L)))))\cr
\+&&&&(&Z\cr
\+&&&&&({\bf DO} &(&(L $\hbox{L}↓3$ ({\bf CDR} L))\cr
\+&&&&&&&(I 1 ({\bf 1+} I))\cr
\+&&&&&&(($=$ I $\hbox{N}↓3$) ({\bf CAR} L))))))\cr
\+&&({\bf CRUNCH} X Y Z))\cr}$$

\ssect Functions

You might ask: Can
a function, like {\bf CRUNCH}, be defined to be
`parallel' so that expressions like the \qlet/ above don't
appear in code?  The answer is no.

The reasons are complex, but the primary reason is lexicality.  Suppose it
were possible to define a function so that a call to that function would
cause the arguments to it to be evaluated in parallel. That is, a form
like (f~$a↓1\ldots a↓n$) would cause each argument, $a↓i$, to be evaluated
concurrently with the evaluation of the others.  In this case, to be safe,
one would only be able to invoke $f$ on arguments whose evaluations were
independent of each other. Because the definition of a function can be,
textually, far away from some of its invocations, the programmer would not
know on seeing an invocation of a function whether the arguments would be
evaluated in parallel.

Halstead's MultiLisp \ref:Halstead.1984. defines another type of parallel
function invocation in which the function-calling form itself states that
parallel evaluation is to take place. This form is called {\bf PCALL}.
Using our formulation, one could define {\bf PCALL} as a macro:

$$\vbox{\settabs\+\cr
\+({\bf PCALL} {\it f} $a↓1\ldots a↓n$)\cr}$$

\noindent would accomplish parallel argument evaluation. Of course,
this is just a macro for a \qlet/ application.

\ssect A Real Example

This is an example of a simple, but real, Lisp function. It performs the
function of the traditional Lisp function, {\bf SUBST}, but in parallel:

$$\vbox{\settabs\+\cr
\+({\bf DEFUN} &{\bf QSUBST} (X Y Z)\cr
\+&({\bf COND} &(({\bf EQ} Y Z) X)\cr
\+&&(({\bf ATOM} Z) Z)\cr
\+&&(&T\cr
\+&&&(&{\bf QLET} &T (&(Q ({\bf QSUBST} X Y ({\bf CAR} Z)))\cr
\+&&&&&&(R ({\bf QSUBST} X Y ({\bf CDR} Z))))\cr
\+&&&&&({\bf CONS} Q R)))))\cr}$$

\sect \qlambda/ Closures

In some Lisps (Common Lisp, for example) it is possible to create
{\it closures}: function-like objects that capture their 
definition-time environment.
When a closure is applied, that environment is re-established. 

\qlet/ application, as we saw above, is a good means for expressing
parallelism that has the regularity of, for example, an underlying data
structure. Because a closure is already a lot like a separate process, it
could be used as a means for expressing less regular parallel
computations.

$$\vbox{\settabs\+\cr
\+({\bf qlambda} &{\it pred} &({\it lambda-list}) &.\ {\it body})\cr}$$

\noindent creates a closure.
{\it Pred} is a predicate that is evaluated before any other action
regarding this form is taken. 
It is assumed to evaluate to either (), {\bf EAGER}, or something else.
If {\it pred} evaluates to (), then the
\qlambda/ acts exactly as a {\bf LAMBDA}. That is, a closure is
created; applying this closure is exactly the same as applying a normal
closure. 

If {\it pred} evaluates to something other than {\bf EAGER}, the \qlambda/
creates a closure that, when applied, is run as a separate process.
Creating the closure by evaluating the \qlambda/ expression is called {\it
spawning}; the process that evaluates the \qlambda/ is called the {\it
spawning process}; and the process that is created by the \qlambda/ is
called the {\it spawned process}.  When a closure running as a separate
process is applied, the separate process is started, the arguments are
evaluated by the spawning process, and a message is sent to the spawned
process containing the evaluated arguments and a return address. The
spawned process does the appropriate lambda-binding, evaluates its body,
and finally returns the results to the
spawning process. We call a closure that will run or is running in its own
process a {\it process closure}. In short, the expression ({\bf
QLAMBDA}~non-()~$\ldots$) returns a process closure as its value.

If {\it pred} evaluates to {\bf EAGER}, then a closure is created which is
immediately spawned. It lambda-binds empty binding cells as described earlier,
and evaluation of its body starts immediately. When an argument is needed, the process
either has had it supplied or it blocks. Similarly, if the process completes
before the return address has been supplied, the process blocks.

This curious method of evaluation will be used surprisingly to write a
parallel \Y/ function!

\ssect Value-Requiring Situations

Suppose there are no further rules for the timing of evaluations than
those given, along with their obvious implications; have we defined a useful
set of primitives?

No. Consider the situation:

$$\vbox{\settabs\+\cr
\+({\bf PROGN} ({\bf F} X) ({\bf G} Y))\cr}$$

\noindent If {\bf F} happens to be bound to a process closure, then the
process evaluating the {\bf PROGN} will spawn off the process to evaluate
({\bf F} X), wait for the result, and then move on to evaluate ({\bf G}
Y), throwing away the value {\bf F} returned. If this is the case, it is
plain that there is not much of a reason to have process closures.

Therefore we make the following behavioral requirement: If a process closure is
called in a value-requiring context, the calling process waits; and if a process
closure is called in a value-ignoring situation, the caller does not wait for
the result, and the callee is given a void return address.

For example, given the following code:

$$\vbox{\settabs\+\cr
\+({\bf LET} &(({\bf F} ({\bf QLAMBDA} T (Y)({\bf PRINT} ($*$ Y Y)))))\cr
\+&({\bf F} 7)\cr
\+&({\bf PRINT} ($*$ 6 6)))\cr}$$

\noindent there is no {\it a priori} way to know whether you will see 49
printed before or after 36.\note{We can assume that there is a single
print routine that guarantees that when something is printed, no other
print request interferes with it. Thus, we will not see 43 and then 96
printed in this example.}

To increase the readability of code we introduce two forms, which could be
defined as macros, to guarantee a form will appear in a value-requiring
or in a value-ignoring position.

$$\vbox{\settabs\+\cr
\+({\bf WAIT} {\it form})\cr}$$

\noindent will evaluate {\it form} and wait for the result;

$$\vbox{\settabs\+\cr
\+({\bf NO-WAIT} {\it form})\cr}$$

\noindent will evaluate {\it form} and not wait for the result.

For example,

$$\vbox{\settabs\+\cr
\+(&{\bf PROGN}\cr
\+&({\bf WAIT} $\hbox{\it form}↓1$)\cr
\+&$\hbox{\it form}↓2$)\cr}$$

\noindent will wait for $\hbox{\it form}↓1$ to complete.

\ssect Applying a Process Closure

Process closures can be passed as arguments and returned as values.
Therefore, a process closure can be in the middle of evaluating
its body given a set of arguments when it is applied by another
process. Similarly, a process can apply a process closure in a
value-ignoring position and then immediately apply the same
process closure with a different set of arguments.

Each process closure has a queue for arguments and return addresses.
When a process closure is applied, the new set of arguments and the
return address is placed on this queue. The body of the process closure
is evaluated to completion before the set of arguments at the head
of the queue is processed.

We will call this property {\it integrity}, because a process closure is
not copied or disrupted from evaluating its body with a set of arguments:
Multiple applications of the same process closure will not create multiple
copies of it.

\sect CATCH and QCATCH

So far we have discussed methods for spawning processes and communicating
results. Are there any ways to kill processes? Yes, there is one basic
method, and it is based on an intuitively similar, already-existing
mechanism in many Lisps.

\catch/ and \throw/ are a way to do non-local, dynamic exits within Lisp.
The idea is that if a computation is surrounded by a \catch/, then a \throw/
will force return from that \catch/ with a specified value, terminating
any intermediate computations.

$$\vbox{\settabs\+\cr
\+({\bf CATCH} {\it tag} {\it form})\cr}$$

\noindent will evaluate {\it form}. If {\it form} returns with a value,
the value of the \catch/ expression is the value of the {\it form}. If the
evaluation of {\it form} causes the form

$$\vbox{\settabs\+\cr
\+({\bf THROW} {\it tag} {\it value})\cr}$$

\noindent to be evaluated, then \catch/ is exited immediately with the
value {\it value}. \throw/ causes all special bindings done between the
\catch/ and the \throw/ to revert.  If there are several \catch/'s, the
\throw/ returns from the \catch/ dynamically closest with a tag {\bf EQ} to the
\throw/ tag.

\ssect \catch/

In a multi-processing setting, when a \catch/ returns a value, all
processes that were spawned as part of the evaluation of the \catch/ are
killed at that time. 

Consider:

$$\vbox{\settabs\+\cr
\+({\bf CATCH} &'QUIT\cr
\+&(&{\bf QLET} &T (&(&X\cr
\+&&&&&({\bf DO} &((L $\hbox{L}↓1$ ({\bf CDR} L)))\cr
\+&&&&&&(({\bf NULL} L) 'NEITHER)\cr
\+&&&&&&({\bf COND} &(({\bf P} ({\bf CAR} L))\cr
\+&&&&&&&({\bf THROW} 'QUIT $\hbox{L}↓1$)))))\cr
\+&&&&(&Y\cr
\+&&&&&({\bf DO} &((L $\hbox{L}↓2$ ({\bf CDR} L)))\cr
\+&&&&&&(({\bf NULL} L) 'NEITHER)\cr
\+&&&&&&({\bf COND} (({\bf P} ({\bf CAR} L))\cr
\+&&&&&&({\bf THROW} 'QUIT $\hbox{L}↓2$))))))\cr
\+&&&X))\cr}$$

\noindent This piece of code will scan down $\hbox{L}↓1$ and $\hbox{L}↓2$
looking for an element that satisfies {\bf P}. When such an element is
found, the list that contains that element is returned, and the other
process is killed, because the \throw/ causes the \catch/ to exit with a
value.  If both lists terminate without such an element being found, the
atom NEITHER is returned.

Note that if 
$\hbox{L}↓1$ and
$\hbox{L}↓2$ are both circular lists, but one of them is guaranteed to
contain an element satisfying {\bf P}, the entire process terminates.

If a process closure was spawned beneath a \catch/ and if that \catch/
returns while that process closure is running, that process closure will be
killed when the \catch/ returns.

\ssect \qcatch/

$$\vbox{\settabs\+\cr
\+({\bf QCATCH} {\it tag} {\it form})\cr}$$

\qcatch/ is similar to \catch/, but if the {\it
form} returns with a value (no \throw/ occurs) and there are other
processes still active, \qcatch/ will wait until they all finish. The value of
the \qcatch/ is the value of {\it form}. For there to be any processes
active when {\it form} returns, each one had to have been applied in a
value-ignoring setting, and therefore all of the values of the outstanding
processes will be duly ignored.

If a \throw/ causes the \qcatch/ to exit with a value, the \qcatch/
kills all processes spawned beneath it.

We will define another macro to simplify code. Suppose we want to
spawn the evaluation of some form as a separate process. Here is
one way to do that:

$$\vbox{\settabs\+\cr
\+(&({\bf LAMBDA} &(F)\cr
\+&&(F) T)\cr
\+&({\bf QLAMBDA} T () {\it form}))\cr}$$

A second way is:

$$\vbox{\settabs\+\cr
\+({\bf FUNCALL} ({\bf QLAMBDA} T () {\it form}))\cr}$$

\noindent We will chose the latter as the definition of:

$$\vbox{\settabs\+\cr
\+({\bf SPAWN} {\it form})\cr}$$

\noindent Notice that {\bf SPAWN} combines spawning and application.

Here are a pair of functions which work together to define a parallel
{\bf EQUAL} function on binary trees:

$$\vbox{\settabs\+\cr
\+({\bf DEFUN} &{\bf EQUAL} (X Y)\cr
\+&({\bf QCATCH} &'EQUAL\cr
\+&&({\bf EQUAL-1} X Y)))\cr}$$

\noindent {\bf EQUAL} uses an auxiliary function, {\bf EQUAL-1}:

$$\vbox{\settabs\+\cr
\+({\bf DEFUN} &{\bf EQUAL-1} (X Y)\cr
\+&({\bf COND} &(({\bf EQ} X Y))\cr
\+&&(&({\bf OR} &({\bf ATOM} X)\cr
\+&&&&({\bf ATOM} Y))\cr
\+&&&({\bf THROW} 'EQUAL ()))\cr
\+&&(&T\cr
\+&&&\cleartabs&({\bf SPAWN} ({\bf EQUAL-1} ({\bf CAR} X)({\bf CAR} Y)))\cr
\+&&&&({\bf SPAWN} ({\bf EQUAL-1} ({\bf CDR} X)({\bf CDR} Y)))\cr
\+&&&&T)))\cr}$$

The idea is to spawn off processes that examine parts of the trees
independently.  If the trees are not equal, a \throw/ will return a () and
kill the computation.  If the trees are equal, no \throw/ will ever
occur.  In this case, the main process will return T to the \qcatch/ in
{\bf EQUAL}. This \qcatch/ will then wait until all of the other processes
die off; finally it will return this T.

\ssect \throw/

\throw/ will throw a value to the \catch/ above it, and processes will be
killed where applicable. The question is, when a \throw/ is seen, exactly
which \catch/ is thrown to and exactly which processes will be killed?

The processes that will be killed are precisely those processes spawned
beneath the \catch/ that receives the \throw/ and those spawned by
processes spawned beneath those, and so on.

The question boils down to which \catch/ is thrown to. To determine that
\catch/, find the process in which the \throw/ is evaluated and look up the
process-creation chain to find the first matching tag.

If you see a code fragment like:

$$\vbox{\settabs\+\cr
\+({\bf QLAMBDA} T () ({\bf THROW} {\it tag} {\it value}))\cr}$$

\noindent the \throw/ is evaluated within the \qlambda/ process closure,
so look at the process in which the \qlambda/ is created to start
searching for the proper \catch/. Thus, if you apply a process closure
with a \throw/ in it, the \throw/ will be to the first \catch/ with a
matching tag {\it in the process chain that the \qlambda/ was created in},
not in the current process chain.

Thus we say that \throw/ throws dynamically by creation.

\sect \unwpro/

When \throw/ is used to terminate a computation, there may be other
actions that need to be performed before the context is destroyed.
For instance, suppose that some files have been opened and their streams
lambda-bound. If the bindings are lost, the files will remain open until
the next garbage collection. There must be a way to gracefully close these
files when a \throw/ occurs. The construct to do that is \unwpro/.


$$\vbox{\settabs\+\cr
\+({\bf UNWIND-PROTECT} {\it form} {\it cleanup})\cr}$$

\noindent will evaluate {\it form}. When {\it form} returns, {\it cleanup}
is evaluated. If {\it form} causes a \throw/ to be evaluated, {\it cleanup}
will be performed anyway. Here is a typical use:

$$\vbox{\settabs\+\cr
\+({\bf LET} &((F ({\bf OPEN} ``FOO.BAR'')))\cr
\+&({\bf UNWIND-PROTECT} ({\bf READ-SOME-STUFF}) ({\bf CLOSE} F)))\cr}$$

In a multi-processing setting, when a cleanup form needs to be evaluated
because a \throw/ occurred, the process that contains the \unwpro/ is retained
to evaluate all of the cleanup forms for that process before it is killed. The process
is placed in an un-killable state, and if a further \throw/ occurs, it has no
effect until the current cleanup forms have been completed,.

Thus, if control ever enters an \unwpro/, it is guaranteed that the cleanup
form will be evaluated. Dynamically nested \unwpro/'s will have their cleanup
forms evaluated from the inside-out, even if a \throw/ has occurred.

To be more explicit, recall that the \catch/ that receives the value
thrown by a \throw/ performs the kill operations. The \unwpro/ cleanup
forms are evaluated in un-killable states by the appropriate \catch/ {\it
before} any kill operations are performed. This means that the process
structure below that \catch/ is left in tact until the \unwpro/ cleanup
forms have completed.

\ssect Other Primitives

One pair of primitives is useful for controlling the operation of the
processes as they are running; they are {\bf SUSPEND-PROCESS} and {\bf
RESUME-PROCESS}.  The former takes a process closure and puts it in a wait
state. This state cannot be interrupted, except by a {\bf RESUME-PROCESS},
which will resume this process.  
This is useful if some controlling process wishes to pause some processes
in order to favor some process more likely to succeed than these.

A use for {\bf SUSPEND-PROCESS} is to implement a general locking
mechanism, which will be described later.

\ssect An Unacceptable Alternative

There is another approach that could have been taken to the semantics of:

$$\vbox{\settabs\+\cr
\+({\bf QLAMBDA} &{\it pred} &({\it lambda-list}) &.\ {\it body})\cr}$$

\noindent Namely, we could have stated that the arguments to a process
closure could trickle in, some from one source and some from another.
Because a process closure could then need to wait for arguments from
several sources, we could use this behavior as a means to achieve the effects
of {\bf SUSPEND-PROCESS}. That is, we could apply a process closure which
requires one argument to no arguments; the process closure would then
need to wait for an argument to be supplied. Because we would not
supply that argument until we wanted the process to continue, supplying the
argument would achieve {\bf RESUME-PROCESS}.

This would be quite elegant, but for the fact that process closures
would then be able to get arguments from anywhere chaotically. We would
have to abandon the ability to know the order of variable-value pairing
in the lambda-binding that occurs in process closures. For instance,
if we had a process closure that took two arguments, one a number and the other
a list, and if one argument were to be supplied by one process and the second
by another, there would be no way to cause one argument to arrive at the
process closure before the other, and hence one would not be sure that
the number paired with the variable that was intended to have a numeric value.

One could use keyword arguments \ref:Steele.1984. in this case, but that
would not solve all the problems with this scheme. How could \&{\bf REST}
arguments be handled?  There would be no way to know when all of the
arguments to the process closure had been supplied.  Suppose that a
process wanted to send 5 values to a process closure that needed exactly 5
arguments; if some other process had sent 2 to that process closure
already, how could one require that the first 3 of the 5 sent would not be
bundled with the 2 already sent to supply the process closure with random
arguments?

In short, this alternative is unacceptable.

\sect The Rest of the Paper

This completes the definition of the extensions to Lisp.  Although these
primitives form a complete set---any concurrent algorithm can be
programmed with only these primitives along with the underlying Lisp---a
real implementation of these extensions would supply further convenient
functions, such as an efficient locking mechanism.

The remainder of this paper will describe some of the tricky things that
can be done in this language, and it will present some performance studies
done with a simple simulator.

%[In particular we discuss: 
%
%\item{1.}How to kill a process with the primitives shown so far.
%\item{2.}How to implement locks.
%\item{3.}How performance can be tuned using the \qlambda/ predicates
%for multi-processors with realistic process creation times.
%\item{4.}How to use eager process closures.
%\item{5.}How to define a concurrent \Y/ function.
%\item{6.}The performance of simple recursive functions under varying
%multiprocessor configurations.
%\item{7.}The serializing effects of data structures on concurrent algorithms.
%\item{8.}The performance of exhaustive-search algorithms (travelling salesman).
%\item{9.}The performance of an AI-like benchmark (one of the benchmarks in
%the Gabriel Lisp Performance Study).
%\item{10.}Comparisons to other work.
%\item{11.}Improvements and extensions to networks of multi-processors.]
%
\sect Resource Management

We've mentioned that we assume a shared-memory Lisp, which implies that
many processes can be accessing and updating a single data structure at the
same time. In this section we show how to protect these data structures with
critical sections to allow consistent updates and accesses.

The key is closures. We spawn a process closure which is to be used as the sole
manager of a given resource, and we conduct all transactions through that
closure. We illustrate the method with an example.  

Suppose we have an application where we will need to know for very many
$n$ whether $\exists~i \hbox{{}\ s.t.\ {}} n=\hbox{Fib}(i)$, where {\bf Fib} is
the Fibonacci function. We will call this predicate {\it Fib-p}.  Suppose
further that we want to keep a global table of all of the Fibonacci
argument/value pairs known, so that {\it Fib-p} will be a table lookup
whenever possible. We can use a variable, \V/, which has a pair---a
cons cell---as its value with the {\bf CAR} being $i$ and the {\bf CDR}
being $n$, and $n = \hbox{\bf Fib}(i)$,
such that this is the largest $i$ in the table.
We imagine filling up this table
as needed, using it as a cache, but the variable \V/ is used in a quick
test to decide whether to use the table rather than Fibonacci function
to decide {\it Fib-p}.

We will ignore the details of the table manipulation and discuss only
the variable \V/. When a process wants to find out the highest Fibonacci
number in the table, it simply will do ({\bf CDR} \V/). If a process wants to
find out the pair ($i$ . Fib($i$)), it had better do this indivisibly 
because some other
processes might updating \V/ concurrently.

We assume that we do not want to {\bf CONS} another pair to update
\V/---we will destructively update the pair.  Thus, we do not want to say:

$$\vbox{\settabs\+\cr
\+$\ldots$\cr
\+({\bf SETQ} \V/ ({\bf CONS} {\it arg} {\it val}))\cr
\+$\ldots$\cr}$$

Here is some code to set up the \V/ handler:

$$\vbox{\settabs\+\cr
\+({\bf SETQ} $*\hbox{\bf V-HANDLER}*$ ({\bf QLAMBDA} T (CODE) (CODE *V*)))\cr}$$

The idea is to pass this process closure a second closure which will perform
the desired operations on its lone argument; the \V/ handler passes \V/ to the
supplied closure.

Here is a code fragment to set up two variables, I and J, which will receive the
values of the components of \V/, along with the code to get those values:

$$\vbox{\settabs\+\cr
\+({\bf LET} &((I ())(J ()))\cr
\+&($\hbox{\bf \Vhand/}$ (&{\bf LAMBDA} (V)\cr
\+&&({\bf SETQ} I ({\bf CAR} V))\cr
\+&&({\bf SETQ} J ({\bf CDR} V))))\cr
\+&$\ldots$)\cr}$$

Because the process closure will evaluate its body without creating any
other copies of itself, and because all updates to \V/ will go through
\Vhand/, I and J will be such that $\hbox{J}=\hbox{Fib}(\hbox{I})$.

The code to update the value of \V/ would be:

$$\vbox{\settabs\+\cr
\+$\ldots$\cr
\+({\bf \Vhand/} (&{\bf LAMBDA} (V)\cr
\+&({\bf SETF} ({\bf CAR} V) {\it arg})\cr
\+&({\bf SETF} ({\bf CDR} V) {\it val})))\cr
\+$\ldots$\cr}$$


If the process updating \V/ does not need to wait for the
update, this call can be put in a value-ignoring position.

\ssect Fine Points

If the process closure that controls a resource is created outside of any
\catch/ or \qcatch/ that might be used to terminate subordinate process
closures, then once the process closure has been invoked, it will be
completed. If this process closure is busy when it is invoked by some
process, then even if the invoking process is killed, the invocation will
proceed. Thus requests on a resource controlled by this process closure
are always completed.
Another way to guarantee that a request happens is to put it inside of an
\unwpro/.

\sect Locks

When we discussed {\bf SUSPEND-PROCESS} and {\bf RESUME-PROCESS} we mentioned
that a general locking mechanism could be implemented
using {\bf SUS\-PEND-PRO\-CESS}. Here is the code
for this example:

$$\vbox{\settabs\+\cr
\+({\bf DEFMACRO} &GET-LOCK ()\cr
\+&'({\bf CATCH} &'FOO\cr
\+&&(&{\bf PROGN}\cr
\+&&&(&LOCK\cr
\+&&&&({\bf QLAMBDA} T (RES)({\bf THROW} 'FOO RES)))\cr
\+&&&({\bf SUSPEND-PROCESS}))))\cr}$$

\vskip .25in
\noindent When {\bf SUSPEND-PROCESS} is called with no arguments, it puts
the currently running job (itself) into a wait state.
\vskip .25in

\settabs\+999&\cr
\+\ 1\ &({\bf LET} &((&LOCK\cr
\+\ 2\ &&&(&{\bf QLAMBDA} T (RETURNER)\cr
\+\ 3\ &&&&({\bf CATCH} &LOCKTAG\cr
\+\ 4\ &&&&&(&{\bf LET} ((RES ({\bf QLAMBDA} T () ({\bf THROW} 'LOCKTAG T))))\cr
\+\ 5\ &&&&&&(RETURNER RES)\cr
\+\ 6\ &&&&&&({\bf SUSPEND-PROCESS}))))))\cr
\+\ 7\ &&\cleartabs&(&{\bf QLET} &T (&(&X \cr
\+\ 8\ &&&&&&&({\bf LET} &((OWNED-LOCK (GET-LOCK)))\cr
\+\ 9\ &&&&&&&&({\bf DO} &((I 10 ($1-$ I)))\cr
\+10\ &&&&&&&&&(&($=$ I 0)\cr
\+11\ &&&&&&&&&&(OWNED-LOCK) 7))))\cr
\+12\ &&&&&&(&Y\cr
\+13\ &&&&&&&({\bf LET} &((OWNED-LOCK (GET-LOCK)))\cr
\+14\ &&&&&&&&({\bf DO} &((I 10 ($1-$ I)))\cr
\+15\ &&&&&&&&&(&($=$ I 0)\cr
\+16\ &&&&&&&&&&(OWNED-LOCK) 8))))))\cr
\+17\ &&&&({\bf LIST} X Y))\cr

\vskip .25in

The idea is to evaluate a GET-LOCK form, which in this case is a macro,
that will return when the lock is available; at that
point, the process that called the GET-LOCK form will have control
of the lock and, hence, the resource in question.
GET-LOCK returns a function that is invoked to release the lock.

Lines 7--17 are the test of the locking mechanism: The \qlet/
on line~7 spawns two processes; the first is the {\bf LET}
on lines 8--11; the second is the {\bf LET} on lines 13--16.
Each process will attempt to grab
the lock, and when a process has that lock, it will count down from
10, release the lock, and return a number---either 7 or 8. The two
numbers are put into a list that is the return value for the test program.

As we mentioned earlier, when a process closure is evaluating its body
given a set of arguments, it cannot be disrupted---no other call
to that process closure can occur until the previous calls are complete.
To implement a lock, then, we must produce a process closure that will
return an unlocking function, but which will not actually return!

\goodbreak
GET-LOCK sets up a \catch/ and calls the LOCK function with
a process closure that will return from this \catch/. The value
that the process closure throws will be the function we use to
return the lock. We call LOCK in a value-ignoring position so that
when the lock is finally released, LOCK will not try to return
a value to the process evaluating the GET-LOCK form. The {\bf SUSPEND-PROCESS}
application will cause the process evaluating the GET-LOCK form to wait
for the \throw/ that will happen when LOCK sends back the unlocking
function.

LOCK takes a function, the RETURNER function, that will return the
unlocking function.  LOCK binds RES to a process closure that throws to
the \catch/ on line~3.  This process closure is the function that we will
apply to return the lock.
The RETURNER function is applied to RES, which throws RES to the catch
frame with tag FOO. Because (RETURNER~RES) appears in a value-ignoring
position, this process closure is applied with no intent to return a
value. Evaluation in LOCK proceeds with the call to {\bf SUSPEND-PROCESS}.

The effect is that the process closure that will throw to LOCKTAG---and
which will eventually cause LOCK to complete---is thrown back to the
caller of GET-LOCK, but LOCK does not complete.  No other call to LOCK
will begin to execute until the \throw/ to LOCKTAG occurs---that is, when
the function, \hbox{OWNED-LOCK}, is applied.

Hence, exactly one process at a time will execute with this lock.

\goodbreak The key to understanding this code is to see that when a
\throw/ occurs, it searches up the process-creation chain that reflects
dynamically scoped \catch/'s.  Because we spawned the process closure in
GET-LOCK beneath the \catch/ there, the \throw/ in the process closure
bound to RETURNER will throw to that \catch/, ignoring the one in LOCK.
Similarly, the \throw/ that RES performs was created underneath the
\catch/ in LOCK, and so the process closure that throws to LOCKTAG returns
from the \catch/ in LOCK.

\ssect Reality.

As mentioned earlier, a real implementation of this language would
supply an efficient locking mechanism. We have tried to keep the
number of primitives down to see what would constitute a minimum
language. 

\sect Killing Processes

We've seen that a process can commit suicide, but is there any way to
kill another process? Yes; the idea is to force a process to commit suicide.
Naturally, everything must be set up correctly.

\goodbreak
We'll show a simple example of this `bomb' technique. 

Here is the entire code for this example:

\vskip .25in

\settabs\+999&\cr
\+\ 1\ &(&{\bf DEFUN} TEST ()\cr
\+\ 2\ &&(&{\bf LET} ((BOMBS ()))\cr
\+\ 3\ &&&(&{\bf LET} ((&BOMB-HANDLER\cr
\+\ 4\ &&&&&({\bf QLAMBDA} &T (TYPE ID MESSAGE)\cr
\+\ 5\ &&&&&&({\bf COND} &(&({\bf EQ} TYPE 'BOMB)\cr
\+\ 6\ &&&&&&&&({\bf PRINT} `(BOMB FOR ,ID))\cr
\+\ 7\ &&&&&&&&({\bf PUSH} `(,ID . ,MESSAGE) BOMBS))\cr
\+\ 8\ &&&&&&&(({\bf EQ} TYPE 'KILL)\cr
\+\ 9\ &&&&&&&&({\bf PRINT} `(KILL FOR ,ID))\cr
\+10\ &&&&&&&&\cleartabs(&{\bf FUNCALL}\cr
\+11\ &&&&&&&&&({\bf CDR} ({\bf ASSQ} ID BOMBS)))\cr
\+12\ &&&&&&&&T)))))\cr
\+13\ &&&&\cleartabs&(&{\bf QLET} &'EAGER (&(&X\cr
\+14\ &&&&&&&&&({\bf CATCH} 'QUIT ({\bf TESTER} BOMB-HANDLER 'A)))\cr
\+15\ &&&&&&&&(&Y\cr
\+16\ &&&&&&&&&({\bf CATCH} 'QUIT ({\bf TESTER} BOMB-HANDLER 'B))))\cr
\+17\ &&&&&&&\cleartabs(&{\bf SPAWN}\cr
\+18\ &&&&&&&&({\bf PROGN} &({\bf DO} &((I 10. ($1-$ I)))\cr
\+19\ &&&&&&&&&&(&($=$ I 0)\cr
\+20\ &&&&&&&&&&&({\bf PRINT} `(KILLING A))\cr
\+21\ &&&&&&&&&&&({\bf BOMB-HANDLER} 'KILL 'A ()))\cr
\+22\ &&&&&&&&&&({\bf PRINT} `(COUNTDOWN A ,I)))\cr
\+23\ &&&&&&&&&({\bf DO} &((I 10. ($1-$ I)))\cr
\+24\ &&&&&&&&&&(($=$ I 0)\cr
\+25\ &&&&&&&&&&&({\bf PRINT} `(KILLING B))\cr
\+26\ &&&&&&&&&&&({\bf BOMB-HANDLER} 'KILL 'B ()))\cr
\+27\ &&&&&&&&&&({\bf PRINT} `(COUNTDOWN B ,I)))))\cr
\+28\ &&&&&&&({\bf LIST} X Y)))))\cr

\vskip .25in

\goodbreak
\settabs\+999\ &\cr
\+29\ &({\bf DEFUN} &TESTER (BOMB-HANDLER LETTER)\cr
\+30\ &&({\bf BOMB-HANDLER} &'BOMB LETTER\cr
\+31\ &&&({\bf QLAMBDA} T () ({\bf THROW} 'QUIT LETTER)))\cr
\+32\ &&({\bf DO} ()(()) ({\bf PRINT} LETTER)))\cr

\vskip .25in

% Bomb Test  Example

First we set up a process closure which will collect bombs and explode
them. Line~2 defines the variable that will hold the bombs. A bomb is
an ID and a piece of code. Lines~3--12 define the bomb handler. It is a
piece of code that takes a message type, an ID, and a message It looks at the
type; if the type is BOMB, then the message is a piece of code. The ID/code
pair is placed on the list, BOMBS. If the type is KILL, then the ID is used to
find the proper bomb and explode it.

Lines~13--28 demonstrate the use of the bomb-handler. Lines~14 and~16 are
\catch/'s that the bombs will kill back to. Two processes are created,
each running TESTER. TESTER sends a bomb to {\bf BOMB-HANDLER}, which is a
process closure that will throw back to the appropriate \catch/. Because
the process closure is created under one of two \catch/'s, the \throw/
will kill the intermediate processes. The main body of TESTER is an
infinite loop that prints the second argument, which will either be the
letter {\bf A} or the letter {\bf B}.

The \qlet/ on line~13 is eager. Unless something kills the two processes
spawned as argument calculation processes, neither X nor Y will ever receive
values. But because the \qlet/ is eager, the {\bf SPAWN} on line~17
will be evaluated. This {\bf SPAWN} creates a process closure that will
kill the two argument processes.

The result of TEST is ({\bf LIST}~X~Y), which will block while waiting for values
until the argument processes are killed.

The killing process (lines~17--27) counts down from 10, kills the first argument
process, counts down from 10 again, and finally kills the second argument process.

To kill the argument process, the {\bf BOMB-HANDLER} is called with the
message type KILL and the name of the process as the ID. The {\bf BOMB-HANDLER}
kills a process by searching the list, BOMBS, for the
right bomb (which is a piece of code) and then {\bf FUNCALL}ing that
bomb.

Because a process closure is created for each call to TESTER (line~31),
and because one is spawned dynamically beneath the \catch/ on line~14 and
the other beneath the \catch/ on line~16, the {\bf BOMB-HANDLER} will not
be killed by the \throw/. When the process that is printing {\bf A} is
killed, the corresponding \throw/ throws {\bf A}. Similarly for the
process printing {\bf B}.

The value of TEST is (A~B). 
Of course there is a problem with the code, which is that the {\bf BOMB-HANDLER}
is not killed when TEST exits.

\sect Eager Process Closures

We saw that EAGER is a useful value for the predicate in 
\qlet/ applications, that is, in constructions of this form:

$$\vbox{\settabs\+\cr
\+({\bf QLET} &{\it pred} &(&($\hbox{\it x}↓1$& $\hbox{\it arg}↓1$)\cr
\+&&&&$\vdots$\cr
\+&&&($\hbox{\it x}↓n$ $\hbox{\it arg}↓2$))\cr
\+&. {\it body})\cr}$$

\noindent But it may not be certain what use it has in the context of a
process closure.

When a process closure of the form:

$$\vbox{\settabs\+\cr
\+({\bf QLAMBDA} &'EAGER &({\it lambda-list}) &.\ {\it body})\cr}$$

\noindent is spawned, it is immediately run. And if it needs arguments or
a return address to be supplied, it waits.

Suppose we have a program with two distinct parts: The first part takes
some time to complete and the second part takes some large fraction of that time
to initialize, at which point it requires the result of the first part. The easiest
way to accomplish this is to start a eager process closure, which will immediately
start running its initialization. When the first part is ready to hand its result
to the process closure, it simply applies the process closure.

Here is an example of this overlapping of a lengthy initialization with a
lengthy computation of an argument:

$$\vbox{\settabs\+\cr
\+({\bf LET} &((F ({\bf QLAMBDA} &'EAGER (X)\cr
\+&&[Lengthy Initialization]\cr
\+&&({\bf OPERATE-ON} X))))\cr
\+&(F [Lengthy computation of X]))\cr}$$

There are other ways to accomplish this effect in this language, but this is the most
flexible technique.

\ssect A Curious Example.

A curious example of this arises when the \Y/ function is being defined in this language.

The \Y/ function is the applicative version of the \Y/ combinator, which
can be used to define {\bf LABELS} (see Scheme \ref:Steele.1978.,
\ref:Sussman.1975.).  We will briefly review the problem that \Y/ solves.

Suppose you write the code:

$$\vbox{\settabs\+\cr
\+({\bf LET} ((CONS ({\bf LAMBDA} (X Y) ({\it CONS} Y X)))) $\ldots$),\cr}$$

\noindent will {\it CONS} refer to the CONS being defined or to the
built-in {\bf CONS}? The answer is that it will refer to the built-in {\bf
CONS}, and this is not a non-terminating definition. It defines a
constructor that builds lists in the {\bf CAR} rather than the traditional
{\bf CDR} direction.

The idea is that the \lambda/ creates a closure that captures the environment
at the time of the closure creation, but this environment does not contain the
binding for CONS because the process has not gotten that far yet---it is
evaluating the form that will be the value to place in the binding for CONS
that it is about to make.

Suppose, though, that you want to define factorial in this style. You cannot
write:

$$\vbox{\settabs\+\cr
\+(&{\bf LET} &((&FACT\cr
\+&&&(&{\bf LAMBDA} (N)\cr
\+&&&&({\bf COND} &(({\bf ZEROP} N) 1)\cr
\+&&&&&(T ($*$ N ({\bf FACT} ($1-$ N))))))))\cr
\+&$\ldots$)\cr}$$

\noindent because the recursive call to {\bf FACT} refers to 
some global definition,
which presumably does not exist. Traditionally there  is a \lambda/-like
form, called {\bf LABELS} which gets around this by creating a special environment
and then re-stuffing the bindings appropriately, but there is a way to avoid
introducing {\bf LABELS}, at the expense of speed.

There is a function, called the \Y/ function, that  allows one to define
a recursive function given an abstraction of the `normal' definition of the
recursive function.  Here is an example of the abstract version of FACT that
we would need:

$$\vbox{\settabs\+\cr
\+{\bf F} $=$ (&{\bf LAMBDA} (G)\cr
\+&(&{\bf LAMBDA} (N)\cr
\+&&({\bf COND} &(({\bf ZEROP} N) 1)\cr
\+&&&(T ($*$ N (G (1- N)))))))\cr}$$

The key property of \Y/ is: 
$\forall f, \hbox{\bf Y}(f)=f(\hbox{\bf Y}(f))$.

If we were to pass {\bf F} the mathematical function {\it fact}, then 
$\hbox{\bf F}(\hbox{\it fact}) = \hbox{\it fact}$ in the mathematical
sense.  That is, {\it fact} is a fixed point for {\bf F}.
If we define FACT to be $\hbox{\bf Y}(\hbox{F})$, FACT is also a
fixed point for {\bf F}, using the property given above. Actually,
\Y/ produces the least fixed point of {\bf F}, but you can read about
that in a textbook.

\goodbreak
The definition of \Y/ in Lisp is:

$$\vbox{\settabs\+\cr
\+(&{\bf DEFUN} Y (F)\cr
\+&({\bf LET} &((H (&{\bf LAMBDA} (G)\cr
\+&&&(F (&{\bf LAMBDA} (X) \cr
\+&&&&({\bf FUNCALL} (G G) X))))))\cr
\+&&({\bf LAMBDA} (X) ({\bf FUNCALL} (H H) X))))\cr}$$

We can trace through the operation of \Y/ briefly. 
$\hbox{\bf Y}(\hbox{F})$
returns a function that looks like:

$$\vbox{\settabs\+\cr
\+({\bf LAMBDA} (X) ({\bf FUNCALL} (H H) X))\cr}$$

\noindent with H that looks like:

$$\vbox{\settabs\+\cr
\+(&{\bf LAMBDA} (G)\cr
\+&(F (&{\bf LAMBDA} (X) \cr
\+&&({\bf FUNCALL} (G G) X))))\cr}$$

\noindent and F is bound to {\bf F} above. What does 
(H~H) return? Well, it is F applied to some function, so it
returns the inner \lambda/ closure---({\bf LAMBDA} (N) $\ldots$)---in
{\bf F} above, which will be applied to X: a good sign.

H takes a function---itself in this case---and binds it to the variable G.
We can substitute H for G throughout to get F being applied to:

$$\vbox{\settabs\+\cr
\+({\bf LAMBDA} (X) ({\bf FUNCALL} (H H) X))\cr}$$

\noindent But {\bf F} simply makes up a closure that binds the variable G
to the above closure and returns it. So if evaluation ever applies G to
something, it simply applies the above closure to its argument.  G would
be applied to something in the event that a recursive call was
to be made.  H is still bound as before, and the F within the H closure is
bound to the code for {\bf F}. Thus we end up in the same situation as we
were at the outset (that is what the property $\forall f, \hbox{\bf
Y}(f)=f(\hbox{\bf Y}(f))$ means!).

\goodbreak
The machinations of this construction have the effect of providing another
{\it approximation} to the function {\it fact} as it is needed, by carefullly
packaging up new closures that will present the code for {\bf F} over and over
as needed.

This is pretty expensive in time and space. It turns out that we can define
\QY/ as follows:

$$\vbox{\settabs\+\cr
\+(&{\bf DEFUN} QY (F)\cr
\+&({\bf LET} &((H (&{\bf LAMBDA} (G)\cr
\+&&&(F (&{\bf QLAMBDA} 'EAGER (X)\cr
\+&&&&({\bf FUNCALL} (G G) X))))))\cr
\+&&\cleartabs(&{\bf QLAMBDA} 'EAGER (X)\cr
\+&&&({\bf CATCH} ({\bf NCONS} ()) ({\bf FUNCALL} (H H) X)))))\cr}$$

\goodbreak
\QY/ is just like \Y/, except that the major closures are eager process
closures, and there is a \catch/ at the toplevel. The eager process closure
at the toplevel will run until it blocks, which means until it needs the
value of X. So (H~H) will begin to be evaluated immediately. Likewise,
subsequent applications of {\bf F} will start off some pre-processing.
Essentially, \QY/ will start spawning off processes that will be
pre-computing the approximations spontaneously. They block when they need
return addresses, but they are ready to go when the main process, the
one calculating factorial, gets around to them.

The \catch/ stops the spawning process when we get to the end.

The performance of \QY/ is between that of \Y/ and {\bf LABELS}, because
\QY/ {\it pipelines} the creation of the approximations.

\sect Performance

The whole point of this language is to provide high performance
for Lisp programs. Because the speed of light and the size of objects
needed to build circuits limits the expected speed of a single-processor
computer, we need multi-processors to achieve higher speeds than these
limits imply.

If the language provided cannot achieve speedups that improve as we
increase the number of processors in a multi-processor configuration, then
there is no point in using that language or in pursuing its
implementation.

Because there are few true multi-processors on the market and because
it is difficult to vary the parameters of the performance
of hardware to study the effects of the variations, we have
chosen to write a rudimentary simulator for this language.
With this simulator we have performed a small number of experiments.
In this section we will briefly describe that simulator and review some
of the results.

\ssect The Simulator

The simulator simulates a multi-processor with a shared memory
and a variable number of processors. So far we have simulated
configurations with 1 to 50 processors. The simulator
is an interpreter for the language described above. Processes
are scheduled on the least busy processor as they are invoked,
but no other load-balancing is performed. The scheduling is round-robin,
but the effect of queues of jobs within a processor is modelled---so
that a process in a wait state does not impact the reported
performance of the multi-processor (much).

Two important aspects of the expected performance are modelled carefully:
the time that it takes to create a process and the time that it takes for
the same memory location to be accessed simultaneously by several
processors.  In creating processes to achieve \qlet/ application,
closures must be created to capture the environment that the argument
forms must evaluate within. This can be a significant overhead in a real
implementation.  Aspects that are not well-modelled are the communications
overhead (sending and receiving messages), simultaneous access to the same
memory region by two processors, and scheduling overhead.  The various
overheads associated with processes can be modelled, to some extent, by
increased process creation times.

\goodbreak
Lisp functions in the underlying Lisp take one unit of time; if one wishes
to be more exact in the simulator these functions must be re-written in the
interpreted multi-processing Lisp.  Function calls take around 3 units of
time and assignments 1 unit, for example.

The simulator comprises 60,000 characters of Lisp code and runs in PDP-10
MacLisp and in Symbolics 3600 ZetaLisp.

\ssect Fibonacci

The first simulation is one that shows that in a real multi-processor,
with a small number of processors and realistic process creation
times, the runtime tuning provided by the predicates in the \qlet/'s
is important. The Figure~1 shows the performance of the Fibonacci function
written as a parallel function on a multi-processor with 5 processors.

\goodbreak
Here is the code:

$$\vbox{\settabs\+\cr
\+&({\bf DEFUN} &FIB (N DEPTH)\cr
\+&&({\bf COND} &(($=$ N 0) 1)\cr
\+&&&(($=$ N 1) 1)\cr
\+&&&(&T\cr
\+&&&&&(&{\bf QLET} &($<$ DEPTH CUTOFF)\cr
\+&&&&&&&(&(&X\cr
\+&&&&&&&&&(FIB ($-$ N 1) ($1+$ DEPTH))\cr
\+&&&&&&&&(&Y\cr
\+&&&&&&&&&&(FIB ($-$ N 2) ($1+$ DEPTH)))))\cr
\+&&&&&&($+$ X Y)))))\cr}$$

\goodbreak
Although this is not the best way to write Fibonacci it serves to
demonstrate some of the performance aspects of a doubly recursive
function.

The x-axis is the value of CUTOFF, which varies from 0--20; the y-axis is the
runtime in simulator units. The curves are the plots of runs
where the process creation time is set to 0, 20, 40, and 100, where 3 such
units is the time for a function call.

As can be seen, for nearly all positive process creation times, the program can
be tuned to the configuration; and for high process creation times, this is
extremely important.
The curves all flatten out because only 177 processes are required by the problem,
and beyond a certain cutoff, which is the depth of the recursion, all of these
processes have been spawned.

\ssect Adding Up Leaves

The performance of a parallel algorithm can depend on the structure
of the data. Not only can a process be serialized by requiring a
single shared resource---such as a data structure---but it can
be serialized by the shape of an unshared data structure. Consider
adding up the leaves of a tree. We assume that the leaves of a
tree---in this case a Lisp binary tree---can either be () or a number.
If a leaf is () it is assumed to have value~0. 

\goodbreak
Here is a simple program to do that:

$$\vbox{\settabs\+\cr
\+({\bf DEFUN} &ADD-UP (L)\cr
\+&({\bf COND} &(({\bf NULL} L) 0)\cr
\+&&(({\bf NUMBERP} L) L)\cr
\+&&(T ({\bf QLET} &T (&(N (ADD-UP ({\bf CAR} L)))\cr
\+&&&&(M (ADD-UP ({\bf CDR} L))))\cr
\+&&&(+$$ N M)))))\cr}$$

The curves in Figure~2 show the speedup graphs for this program on a full
binary tree and on a CDR tree. A full binary tree is one in which every
node is either a leaf or its left and right subtrees are the same height.
A CDR tree is one in which every node is either a leaf or the left son is
a leaf and the right son is a CDR tree.

These `speedup' graphs have the number of processors on the x-axis and
the ratio of the speed of one processor to the speed of $n$~processors on the
y-axis. Theoretically with $n$ processors one cannot perform any task
faster than $n$ times faster than with one, so the best possible curve
would be a $45↑\circ$ line.

Note that a full binary tree shows good speedup because the two processes
sprouted at each node do the same amount of work, so that the load between
these processes are balanced: If one process were to do the entire task,
it would have to do the work of one of the processes and then the work of
the other, where the amount of work for each is the same. With a CDR tree,
the process that processes the {\bf CAR} immediately finds a leaf and it
terminates. If a single process were to do the entire task, it would only
need to do a little extra work to process the {\bf CAR} over what it did
to process the {\bf CDR}.

From this we see that the structure of the data can serialize a process.

Let's look at another way to write this program, which will demonstrate
the serialization of a shared resource:

$$\vbox{\settabs\+\cr
\+({\bf DEFUN} &ADD-UP (L)\cr
\+&(&{\bf LET} ((SUM 0))\cr
\+&&({\bf LET} &((&ADDER\cr
\+&&&&({\bf QLAMBDA} &T (X)\cr
\+&&&&&({\bf SETQ} SUM ($+$ SUM X)))))\cr
\+&&&\cleartabs({\bf QCATCH} &'END\cr
\+&&&&({\bf NO-WAIT} ({\bf SPAWN} (ADD-ALL ADDER L))))\cr
\+&&&SUM)))\cr}$$

$$\vbox{\settabs\+\cr
\+({\bf DEFUN} &ADD-ALL (ADDER X)\cr
\+&({\bf COND} &(({\bf NULL} X) T)\cr
\+&&(&({\bf NUMBERP} X)\cr
\+&&&({\bf WAIT} (ADDER X)))\cr
\+&&\cleartabs(T &({\bf SPAWN} (ADD-ALL ADDER ({\bf CAR} X)))\cr
\+&&&(ADD-ALL F ({\bf CDR} X)))))\cr}$$

This program works by creating a process closure (called ADDER in ADD-UP)
that will perform all of the additions. SUM is the variable that
will hold the sum.

ADD-ALL searches the tree for leaves. When it finds one, if the leaf is (), the
process returns and terminates; if the leaf is a number, the number is sent in
a message to ADDER. Then the process returns and terminates.
We {\bf WAIT} for ADDER in ADD-ALL so that SUM cannot be returned from ADD-UP
before all additions have been completed.

If the node is an internal node, a process is spawned which explores the {\bf CAR}
part, while the current process goes on to the {\bf CDR} part.
The performance of this program is not as good as the other because ADDER
serializes the additions, which form a significant proportion of the
total computation. If the search for the leaves were more complex, then this
serialization might not make as much difference. 
The curves in Figure~3 show the speedup for this program on the same full
binary and CDR trees as in Figure~2.

\ssect Data Structures

As we have just seen, the shape of a data structure can influence
the achieved degree of parallelism in an algorithm. Because
most modern Lisps support arrays and some support vectors, we
recommend using arrays and vectors over lists and even trees---these
random-access data structures do not introduce access-time penalties that
could adversely affect parallelism. To assign a number of processes
to subparts of a vector only requires passing a pointer to the vector
and a pair of indices indicating the range within which the process
is to operate.

\ssect Traveling Salesman

The performance of an algorithm can depend drastically on the
details of the data and on the distribution of the processes
among the processors. A variant of the traveling salesman illustrates these
points.

The variant of the traveling salesman problem is as follows:  Given $n$~cities
represented as nodes in a graph where the arcs between the
nodes are labelled with a cost, find the path with the lowest total cost
that visits each of the cities, starting and ending with a given city. That
is, we want to produce the shortest circuit.

The solution we adopt, for illustrative purposes, is exhaustive search.
We will sprout a process that takes as arguments a control process, a
node, a path cost so far, and a path. This process will check several
things:  First it sees whether the path cost so far is less than the path
cost of the best circuit known. If not, the process dies. Next the process
checks whether the node is the start node. If so, and if the path is a
complete circuit---if it visits every node---then the control process is
sent a progress report which states the path cost and the path.

Failing these, the process spawns a process for each outgoing arc from the
node, including the arc just traversed to get to this node---it is possible that
a node is isolated with only one arc connecting it to the remainder of the
graph.

The control program simply keeps track of the best path and its cost. It maintains
these as two separate global variables, and its purpose is to keep them
from being updated inconsistently.

The key is that it may take a long time to find the first
circuit---initially the best path cost is~$\infty$.  Once a circuit is
found many search processes can be eliminated. If there are relatively few
processors and many active processes, then the load on each processor can
interfere with finding the first circuit, and many processes can be
spawned, increasing the load.  We have intentionally kept the problem
general and not subject to any heuristics that would help find a circuit
rapidly, in order to explore the performance of various multi-processors
on the problem.

Figure~4 shows the speedup graph. Two things are of note. One is that 
it wildly fluctuates as the number of processors increases. Drawing a
smooth curve through the points results in a pleasing performance improvement,
but the fluctuations are great, and in particular 21~processors does
much better than~31, even though 36~processors is better than either of those
configurations.

In the round-robin scheduler, the positioning of the process---relative to
the other processes---that will find the first complete circuit is
critical, especially when those processes are sprouting other processes at
a high rate.

The graph, by the way, is for a problem with only 5~cities.

The second thing to note is that the graph goes above the curve of the
theoretically best performance---the $45↑\circ$~line. This is because the
1~processor case is sprouting processes, and is thrashing very much
worse than a non-multi-processed version of the algorithm would. In other
words, all such speedup graphs need to be normalized to the highest point
of curve, not the 1~processor case.

\ssect Browse

Browse is a benchmark used as one of a series of benchmarks for evaluating
the performance of Lisp systems. \ref:Gabriel.1982. It essentially builds a
data base of atoms and their property lists.  Each property list contains
some `information' in the form of a list of tree structures. The list of
atoms is randomized and a sequence of patterns is matched, one at a time,
against each of the tree structures on the property list of each of the
atoms.  In this pattern matcher EQL objects match, `?' variables match
anything, and `$*$' variables match a list of 0 or more things. A variable
of the form `$*$-{\it atom}' must match the same list with each
occurrence.

The pattern matcher and the control of the exhaustive matching have been
written as parallel code. This sort of `searching' and matching in data bases
of this form is typical of many artificial intelligence programs, especially
expert systems. The performance of multi-processors on this benchmark is
remarkable.

Two curves are shown in Figure~5: One shows the speedup with the
process creation time set to 10 (where a function call is 3), and the
other is with the process creation time set to 30. In both cases there is
near linear improvement as the number of processors increases.
Approximately 1000 processes are sprouted in this benchmark, although at
most 187~processes are alive at any given time, averaging 117 with a
standard deviation of~.25.

\sect Software Pipelining

Multi-processing programming languages must support the styles of
programming which are natural to the programmer, and they must be able to
express all of the forms of concurrency that exist.
In this section we will introduce a style of programming, called {\it
software pipelining}, which can produce speedup in sequential code under
certain circumstances.  Moreover, sequential code can be easily
transformed into pipelined code.

\ssect Pipelining

Pipelining is a technique used in computer architectures to increase the
execution speed of programs.  The idea is that, while one instruction is
being executed, the next instruction in the instruction stream can be
decoded.\note{We are using a 2-stage pipeline as an example; longer
pipelines are typical in existing hardware.  In the S-1 Mark IIA
\ref:Correll.1983. the pipeline is 11 stages long.}  Thus, the execution of one
instruction can overlap the execution of another. This is possible because
the execution of an instruction can be broken up into independent stages.
These stages result in separate pieces of hardware that perform the steps
in each stage.

There are several complications with this scheme as far as the hardware is
concerned. First, if the second instruction requires a result from the first
instruction, and if the result is not ready for use from the first instruction
when the second instruction wants to use it, then the pipeline `blocks.' This
delay can cause the throughput of the pipeline to decrease, and sometimes this
decrease can be significant.

Second, if the first instruction is a conditional jump, then perhaps the
address to which control will pass might not be known when the second
instruction is being fetched. This will result in a pipeline blockage
also. This type of complication is similar to the first---simply consider
the address from which the the next instruction is to be fetched as a
result of the current instruction. Normally this address---the program
counter or PC---is implicitly known to be the next sequential PC after
that of the current instruction.

The effect of pipelining is that programs that are not inherently parallel
can enjoy some degree of parallelism and, hence, can run faster on a
pipelined architecture than on the same architecture without pipelining.
There is a window of instructions such that if one instruction within that
window depends on the result of a second instruction within that window,
then we might expect the pipeline to block.  This window is certainly no
larger than the length of the pipeline, and this window slides along the
stream of executed instructions. For most pipelined computers the window
is smaller than the length of the pipeline.

The hope is that the window is small enough and the dependencies within
that window are rare enough that there will not be a significant slowdown
from the ideal situation, which is a program with no dependencies within
any window.

\ssect Software Pipelining

{\it Software pipelining} is the adaptation of the idea of hardware
pipelining to the language, Qlisp. We will see that often programs can be
pipelined when they cannot be made fully parallel, and we will also
see that dependencies between stages of a software pipeline can be
implemented using a locking mechanism with certain characteristics.

There are two sorts of parallelism in Qlisp: 1)~the true parallelism
that derives from the parallel evaluation of arguments in a \qlet/, and
2)~the unstructured concurrency of process-closure invocation,
particularly when the invocation is in a value-ignoring situation.  In the
latter case, a number of processes are created and messages are passed
among them, causing some pattern of concurrent activity.

Software pipelining is a subspecies of the latter type of parallelism, but
it is a structured subspecies.  Suppose that there is a computation which
is inherently sequential and which must be performed repeatedly.  In fact,
suppose that this computation takes a certain set of arguments and that we
intend to stream sets of arguments to that computation at the fastest
possible rate.  If that computation is expressed as a process closure,
then the second set of arguments cannot be processed until the first set
has been completely processed.  To apply software pipelining to that
computation, we break up that computation into stages such that the amount
of information that needs to passed from one stage to the next is
manageable; each stage is a process closure.
  Because each stage will be smaller than the entire original
computation, the second set of arguments can enter the computation sooner
after the first set than in the original computation.

An example of an inherently sequential computation is one which performs
some actions on a shared resource, such as a global data structure.

\sssect Simple Example

We will present a simple example of a computation along with a pipelined
implementation of it. This example is designed to explain the concept of
software pipelining; the example is not an example of a
computation that requires software pipelining---much better speedups
can be achieved with true parallelism or even with a reformulation of the
algorithm than with pipelining.

The example is polynomial evaluation: Given a polynomial, $P(x)$, and a
value, $v$,  for $x$, produce $P(v)$. Let

$$P(x)=5x↑4+4x↑3+3x↑2+2x+1,$$

\noindent then using Horner's rule we have that,

$$P(x)=(((5x+4)x+3)x+2)x+1$$.

We can define four functions---$F↓1$, $F↓2$, $F↓3$, and $F↓4$---such that

$$P(x)=F↓4(x,F↓3(x,F↓2(x,F↓1(x)))),$$

\noindent as follows:

$$\vbox{\settabs\+\cr
\+({\bf DEFUN} &$F↓1$ (X)\cr
\+&($+$ ($*$ 5 X) 4))\cr}$$

$$\vbox{\settabs\+\cr
\+({\bf DEFUN} &$F↓2$ (X V)\cr
\+&($+$ ($*$ V X) 3))\cr}$$

$$\vbox{\settabs\+\cr
\+({\bf DEFUN} &$F↓3$ (X V)\cr
\+&($+$ ($*$ V X) 2))\cr}$$

$$\vbox{\settabs\+\cr
\+({\bf DEFUN} &$F↓4$ (X V)\cr
\+&($+$ ($*$ V X) 1))\cr}$$

These functions define stages of the computation of $P$. Each stage is
largely independent of the others aside from the values $x$ and $v$ which
are passed as arguments, and each stage does about the same amount of
computation as the others.  Given these definitions, the pipelined version
of the original polynomial evaluation function can be written:
\vskip .25in

\settabs\+\cr
\+(&{\bf DEFUN} HORNER-STREAM ()\cr
\+&(&{\bf QCATCH} 'HST\cr
\+&&({\bf LABELS} &(&(P1 &(&{\bf QLAMBDA} T (X)\cr
\+&&&&&&(P2 ($+$ ($*$ 5 X) 4) X)\cr
\+&&&&&&&T))\cr
\+&&&&\cleartabs(&P2 &(&{\bf QLAMBDA} T (X V)\cr
\+&&&&&&(P3 X ($+$ ($*$ V X) 3))\cr
\+&&&&&&T))\cr
\+&&&&(&P3 &(&{\bf QLAMBDA} T (X V)\cr
\+&&&&&&(P4 X ($+$ ($*$ V X) 2))\cr
\+&&&&&&T))\cr
\+&&&&(&P4 &(&{\bf QLAMBDA} T (X V)\cr
\+&&&&&&($+$ ($*$ V X) 1))))\cr
\+&&&&(P1 $x↓1$)(P1 $x↓2$)(P1 $x↓3$)$\ldots$)))\cr
\vskip .25in

The \qcatch/ will kill all of the processes when HORNER-STREAM is exited,
but only after each of the process closures, P1--P4, has finished its last
computation.  {\bf LABELS} is a construct for defining mutually recursive
functions, and each one of P1--P4 is bound to a process closure.  P1
corresponds to $F↓1$, P2 corresponds to $F↓2$, P3 corresponds to $F↓3$,
and P4 corresponds to $F↓4$. The next stage of each pipe is called in a
value-ignoring position.

The ellipsis [$\ldots$] is a sequence of invocations of P1 for various
arguments. Each process, P1--P4, is presumed to be running on difference
processors.

Imagine a stream of arguments to P1, $v↓1\ldots v↓n$. Suppose $v↓1$ is passed to
P1 and then immediately $v↓2$ is passed. P1 cannot process $v↓2$ until $v↓1$
has been processed; as soon as $5x+4$ has been passed on to P2,
P1 can accept $v↓2$, and so on. When P4 is processing the final stage of the
computation of $P(v↓1)$, P3 can be processing the third stage of $P(v↓2)$,
P2 can be processing the second stage of $P(v↓3)$, and P1 can be processing
the first stage of $P(v↓4)$.

The straightforward code for $P$ is:
\vskip .25in

\settabs\+\cr
\+(&{\bf DEFUN} HORNER ()\cr
\+&({\bf LABELS} &((P1 (&{\bf LAMBDA} (X)\cr
\+&&&($+$ 1 ($*$ X ($+$ 2 ($*$ X ($+$ 3 ($*$ X ($+$ 4 ($*$ 5 X)))))))))))\cr
\+&&&&(P1 $x↓1$)(P1 $x↓2$)(P1 $x↓3$)$\ldots$)))\cr
\vskip .25in

Using a simple performance simulator for Qlambda \ref:Gabriel.1984., when
40~requests are streamed through HORNER and HORNER-STREAM, HORNER-STREAM is
approximately 3.8~times faster than HORNER, compared with the theoretical
maximum of 4.0.

Of course, there are better ways to speed up this particular program, but the
technique is the important point.

\sssect Discussion

There are several things to note about this example and the technique.
First, the amount of computation per process closure is quite small, and
the performance of the pipeline will depend on the function-call overhead
in HORNER-STREAM which is not present in HORNER. Note that because
process-closures are invoked to move control from one stage of the pipe to
the next, function-call overhead is, in reality, message-passing overhead.

Second, the software pipelining style of programming is very much like
that used in continuation-passing style.  With continuation passing,
an additional function---the continuation---is passed as an argument to
each function. When a function computes a value, it applies the
continuation to that value in a tail-recursive position rather than
returning to its caller. With software pipelining, the continuation for
each stage of the pipe is the next stage of the pipe, if there is
one.

The last stage of a pipeline can either invoke some other process, or else
that stage could invoke a continuation that had been explicitly passed to it.

Third, this example does not contain any global or special variables. For this
technique to be useful, it ought to be possible to use it in the presence of
global variables. Of course, heavy use of globals will diminish the performance
advantages, much as instruction dependencies limit the effectiveness of a
pipeline in a computer.

\sssect Global Variable Example

In the simple example of polynomial evaluation, all of the information
that is passed among the stages is passed from one stage to the next
as arguments. Therefore, we can say that all of the information flow
is in the forward direction---each stage can only receive information
from the preceeding stages. With global variables it is possible for
this information to be passed in the backward direction.

One use of global variables is for one invocation of a function to
communicate with a later invocation. The first function can store some
value in a variable whose extent is indefinite---a global variable.  A
later invocation of that function (or some other function) can read that
value and act on it.

Global variables are used quite often to record a {\it state} for a function.
In Lisps that do not support closures, a programmer will often construct 
a closure equivalent by packaging the code for a function with an 
explicit environment,
which is simply a set of global variables. In a closure, maintaining an
environment---a state---is simply an example of one function invocation
communicating with a later one. 

In the instruction stream in a computer, an early instruction can write some
value into a memory location or a register, and a later instruction
can read and use that value. When this happens within the pipeline window,
as we discussed earlier, the pipeline may block, and performance
may be lost.

Similarly, we can define a software pipeline in which later stages of
the pipe can communicate with earlier stages---early arguments to the
pipeline can influence the behavior of the pipeline on later arguments.

To handle global variables, we use vectors in place of variables and a
locking mechanism to keep reads and writes of the global variables straight.
We will demonstrate the technique with a variant of the polynomial example:

$$P(x)=5x↑4+4x↑3+3x↑2+Qx+1$$.

\noindent Q is a global variable, and let us assume that after each evaluation
of  $P$, Q is incremented. Thus, the HORNER code should become:
\vskip .25in

\settabs\+\cr
\+(&{\bf DEFUN} HORNER ()\cr
\+&(&{\bf LABELS}\cr
\+&&((P1 (&{\bf LAMBDA} (X)\cr
\+&&&({\bf LET} &((ANS &($+$ 1 ($*$ X ($+$ Q ($*$ X ($+$ 3 ($*$ X ($+$ 4 ($*$ 5 X))))))))))\cr
\+&&&&({\bf SETQ} Q ($1+$ Q))\cr
\+&&&&ANS))))\cr
\+&&$\ldots$))\cr
\vskip .25in

In this new version of HORNER, an early invocation of the code will
communicate with a later invocation by changing the value of Q.

For the sake of the example, we will stipulate that the global variable, Q,
must be updated after the computation of the value of the polynomial has been
completed, though there is no apparent reason for this in the code.
This will allow us to see how a variable might be handled at various stages
in the pipeline.

Because there are four stages in the pipe, we will subsitute a 4-long
vector for Q. We will call an index into this vector a `slice of Q.'
Each slice is associated with  one complete flow of control through
the pipeline. That is, each time a value is passed through the pipe in
order to evaluate the polynomial once, a slice of Q is assigned to that
computation.

To allocate the slice, we will create a lock for each slice of Q. In fact,
there will be a 4-long vector that holds the locks. In \ref:Gabriel.1984.
we showed that locks could be implemented with only those primitives
described earlier. However, Qlisp directly supports a more streamlined
locking mechanism.

In Qlisp, locks are first-class citizens and can be passed around
as any other Lisp object can. Locks can be created, gotten, and released.
One interesting aspect of Qlisp locks is that they can be {\it named}.
Most of the time anonymous locks are sufficient, and such locks are granted
in the order in which requests are received. 
A lock is a Qlisp structure with several fields: the owner field, the request-queue
field, and the {\it name} field. Locks are created and pointers to them
are passed exactly as with other Lisp objects. In order to request a lock,
a process must have a pointer to it; if a process requests a lock and
it already has an owner, then that requestor is put in the queue for that
lock. If there is no owner, then the name of the lock comes into play.

When a lock is owned by a certain process, that process might wish to
pass the lock to a particular process. To do this, the owner sets the
name of the lock to some partcular value.

A lock which has a name can only be granted to a process that asks for
that lock by its name. If several processes request a named lock, then the
first process that asks for that lock by its name gets it.  If a request for a
named lock is placed, and the lock has no name (at that time), then the
request is granted regardless of what name was supplied by the requestor.

Named locks will solve a difficult problem in pipelines with global variables.

This is the new code for HORNER-STREAM:
\vskip .25in

\settabs\+999&\cr
\+\ 1\ &(&{\bf DEFUN} HORNER-STREAM ()\cr
\+\ 2\ &&(&{\bf QCATCH} 'HST\cr
\+\ 3\ &&&({\bf LABELS} &(&(P1 &({\bf LET} &(&(N 0)\cr
\+\ 4\ &&&&&&&&(ID ({\bf NCONS} ()))\cr
\+\ 5\ &&&&&&&&(NEXT-ID ({\bf NCONS} ())))\cr
\+\ 6\ &&&&&&&(&{\bf QLAMBDA} T (X)\cr
\+\ 7\ &&&&&&&&({\bf LET} &((LOCK ({\bf GET-LOCK} (LOCK N))))\cr
\+\ 8\ &&&&&&&&&(P2 X ($+$ ($*$ 5 X) 4) LOCK N ID NEXT-ID)\cr
\+\ 9\ &&&&&&&&&({\bf SETQ} ID NEXT-ID)\cr
\+10\ &&&&&&&&&({\bf SETQ} NEXT-ID ({\bf NCONS} ()))\cr
\+11\ &&&&&&&&&({\bf SETQ} N ({\bf REMAINDER} ($1+$ N) 4))\cr
\+12\ &&&&&&&&&&T))\cr
\+13\ &&&&&\cleartabs(&P2 &(&{\bf QLAMBDA} T (X V LOCK N ID NEXT-ID)\cr
\+14\ &&&&&&(P3 X ($+$ ($*$ V X) 3) LOCK N ID NEXT-ID)\cr
\+15\ &&&&&&T))\cr
\+16\ &&&&&(&P3 &(&{\bf QLAMBDA} T (X V LOCK N ID NEXT-ID)\cr
\+17\ &&&&&&(P4 X ($+$ ($*$ V X) (GET-Q N ID)) LOCK N ID NEXT-ID)\cr
\+18\ &&&&&&T))\cr
\+19\ &&&&&(&P4 &(&{\bf QLAMBDA} T (X V LOCK N ID NEXT-ID)\cr
\+20\ &&&&&&&\cleartabs({\bf LET} &((&ANS\cr
\+21\ &&&&&&&&&($+$ ($*$ V X) 1)))\cr
\+22\ &&&&&&&&({\bf SETF} ({\bf ASET} $*$QAR$*$ N) ($1+$ ({\bf AREF} $*$QAR$*$ N)))\cr
\+23\ &&&&&&&&({\bf SET-LOCK-NAME} LOCK NEXT-ID)\cr
\+24\ &&&&&&&&({\bf RELEASE-LOCK} LOCK)\cr
\+25\ &&&&&&&&ANS))))))\cr
\+26\ &&&&$\ldots$)))\cr
\vskip .25in

\noindent where GET-Q is defined as:
\vskip .25in

\settabs\+999&\cr
\+27\ &({\bf DEFUN} &GET-Q (N ID)\cr
\+28\ &&({\bf LET} &((LOCK (GET-NAMED-LOCK ID (PREVIOUS-LOCK N))))\cr
\+29\ &&&({\bf LET} &((Q ({\bf AREF} $*$QAR$*$ (PREVIOUS-INDEX N))))\cr
\+30\ &&&&({\bf SETF} ({\bf AREF} $*$QAR$*$ N) Q)\cr
\+31\ &&&&({\bf SET-LOCK-NAME} LOCK ())\cr
\+32\ &&&&(RELEASE-LOCK LOCK)\cr
\+33\ &&&&Q)))\cr
\vskip .25in

\noindent where $*$QAR$*$ holds the slices of Q.

N is a variable that is local to process P1 and whose values circulate
among 0-1-2-3-0$\ldots$. As each new argument arrives at
P1, the proper slice of Q and the correct lock are selected by this N.
When control is passed onto P2, the second pipe stage, N is 
incremented (mod~4).

ID and NEXT-ID are variables that are local to process P1; 
ID is the name to be used for the current lock, and NEXT-ID is the
name for the next lock.  These two variables are initialized
on lines~4 and~5. {\bf NCONS} takes one argument and returns
a list containing that argument as its sole element. The idiom
$({\bf NCONS}~())$ is frequently used to create a unique pointer for
use as a name.

There are 4~locks used to control access to the four slices of Q.  Each of
the 4~locks---one for each stage---is initially given a null name.

The lock for the $N↑{\hbox{th}}$ slice of Q is grabbed at line~7, so that
no other process has access to that slice of Q until it is released with
\release-lock/.  Each time a lock is grabbed at line~7, it has no name.
The lock, along with the values of N, ID, and NEXT-ID, are passed from stage
to stage. 

A process executing in stage P4 wants the next process in line to get a
hold of the value of Q it just wrote---no other process should be able to
get to it first. Therefore, we want to name the lock in such a way that
this is inevitable. Consider the environments of the processes in
question. From within stage P4, ID is the current name and NEXT-ID is the
name of the next process in line behind the one currently
 in P4 (lines~9 and~10).
Thus, the value of NEXT-ID in stage P4 is \smcp{EQ} to the value of ID in
stage P3.

At line~23 the name of the lock for the $N↑{\hbox{th}}$ slice of
Q is set to NEXT-ID. Because the process in stage P4 has had control of
this lock since that process started in stage P1, no other process can have
control of it, and now that the lock's name has been set, there is exactly
one process that can get a hold of it---the process that asks for that name,
which is the process right behind the one in stage P4.

The process in stage P3 asks for the value of Q, using GET-Q, at line~17.
In GET-Q at line~28, the lock with ID as its name is requested. This will
result in the process in stage P3 getting the lock right after the process
in stage P4 is through with it. In GET-Q, after the value for Q written
into the $N↑{\hbox{th}}$ slice of Q has been copied into the
${(N+1)↑{\hbox{st}}} (mod~4)$ slice of Q (line~30), the name of the lock
is set to () (line~31), and the lock is released. Now any other process
can grab this lock.

The worst case is that the reference to Q in P3 will need to wait until
stage P4 of the previous computation has released that slice; 
this will happen quite frequently.  Also, the amount of
computation that goes into each stage is quite low compared with the
amount of computation that goes into maintaining the locks.  The result is
that HORNER-STREAM is slower than HORNER on 40 computed values by a factor
of approximately 1.7.

There are two factors which could improve the performance of pipelines
that communicate among stages with global variables:  1)~The frequency of
interaction between pipe stages using global variables could be decreased,
and 2)~the amount of computation per pipe stage could be increased.

The frequency of global variable use as a determiner of the performance
of a pipeline is obvious. The amount of computation per stage as compared
with the amount of computation needed to maintain the locks can be
shown to be a determiner of performance by modifying HORNER and HORNER-STREAM
so as to increase the amount of computation
per pipe stage.
\vskip.25in

\settabs\+\cr
\+(&{\bf DEFUN} HORNER-STREAM ()\cr
\+&(&{\bf QCATCH} 'HST\cr
\+&&({\bf LABELS} &(&(P1 &({\bf LET} &(&(N 0)\cr
\+&&&&&&&(ID ({\bf NCONS} ()))\cr
\+&&&&&&&(NEXT-ID ({\bf NCONS} ())))\cr
\+&&&&&&(&{\bf QLAMBDA} T (X)\cr
\+&&&&&&&({\bf LET} &((LOCK ({\bf GET-LOCK} (LOCK N))))\cr
\+&&&&&&&&(P2 X &($+$ ($*$ (DELAY 5) X) (DELAY 4))\cr
\+&&&&&&&&& LOCK N ID NEXT-ID)\cr
\+&&&&&&&&\cleartabs({\bf SETQ} ID NEXT-ID)\cr
\+&&&&&&&&({\bf SETQ} NEXT-ID ({\bf NCONS} ()))\cr
\+&&&&&&&&({\bf SETQ} N ({\bf REMAINDER} ($1+$ N) 4))\cr
\+&&&&&&&&&T))\cr
\+&&&&\cleartabs(&P2 &(&{\bf QLAMBDA} T (X V LOCK N ID NEXT-ID)\cr
\+&&&&&(P3 X ($+$ ($*$ V X) (DELAY 3)) LOCK N ID NEXT-ID)\cr
\+&&&&&T))\cr
\+&&&&(&P3 &(&{\bf QLAMBDA} T (X V LOCK N ID NEXT-ID)\cr
\+&&&&&(P4 X ($+$ ($*$ V X) (DELAY (GET-Q N ID))) LOCK N ID NEXT-ID)\cr
\+&&&&&T))\cr
\+&&&&(&P4 &(&{\bf QLAMBDA} T (X V LOCK N ID NEXT-ID)\cr
\+&&&&&&\cleartabs({\bf LET} &((&ANS\cr
\+&&&&&&&($+$ ($*$ V X) (DELAY 1))))\cr
\+&&&&&&(INCR-GLOBAL N)\cr
\+&&&&&&({\bf SET-LOCK-NAME} LOCK NEXT-ID)\cr
\+&&&&&&({\bf RELEASE-LOCK} LOCK)\cr
\+&&&&&&ANS))))))\cr
\+&&&$\ldots$)))\cr
\vskip .25in

\noindent where DELAY is a macro defined as:
\vskip .25in

\settabs\+\cr
\+({\bf DEFMACRO} &DELAY (FORM)\cr
\+&`({\bf DO} &((I $*$DELAY$*$ ($1-$ I)))\cr
\+&&(({\bf ZEROP} I) ,FORM)))\cr
\vskip .25in

We can similarly modify HORNER. With $*$DELAY$*$ set to 20,
HORNER-STREAM is faster than HORNER over 20 polynmial evaluations
by a factor of approximately 2.2.

\ssect Comparison with Other Techniques

Qlisp is essentially an asynchronous language---we assume several
CPU's attached to a single address space, where each CPU can run several
jobs (or processes) at once in a timeshared mode. We do not assume that
there is any explicit control of the scheduling of processes outside of
those implied by the language (locks and flow of data and control).

Because of this, Qlisp supports a style of parallelism that is quite
different from both systolic arrays and systolic-array-like mechanisms
like the Bagel \ref:Shapiro.1983..  There is no network or grid of
processors, and data/control is not shunted from one processor to the
next. Therefore, the use of locks is necessary to control access to
globals.

Moreover, within each stage of a pipeline it is possible to use the
full power of Qlisp to achieve local speedups---each stage can
exploit a high degree of parallelism within it. This is not
readily performed with any of the systolic-array-like techniques.

The examples we have shown have been coded using the underlying
Qlisp mechanisms without using any macros or other structuring.  The
code in each stage of the pipes presented, along with the actions to deal
with global variables within those stages, is so stylized that macros are
easily written to support software pipelining. This would shorten and
simplify the unreadable code we have been using for expository purposes.

The Qlisp programming environment supports a suite of such macros, and
here is how the second version of HORNER-STREAM is actually written:

\vskip .25in

\settabs\+\cr
\+({\bf DEFUN} &HORNER-STREAM ()\cr
\+&({\bf PIPELINE} &FOO ((Q 0))\cr
\+&&(&({\bf STAGE} (X) X ($+$ ($*$ 5 X) 4))\cr
\+&&&({\bf STAGE} (X V) X ($+$ ($*$ V X) 3))\cr
\+&&&({\bf STAGE} (X V) X ($+$ ($*$ V X) ({\bf GLOBAL-REF} Q)))\cr
\+&&&({\bf STAGE} (X V) &($+$ ($*$ V X) 1)\cr
\+&&&&({\bf SETF} &({\bf GLOBAL-REF} Q)\cr
\+&&&&&($1+$ ({\bf GLOBAL-REF} Q)))))\cr
\+&&$\ldots$))\cr
\vskip .25in

In this formulation, each stage simply invokes the next; the form
for a stage in the pipe is:

\settabs\+\cr
\+&&&({\bf STAGE} $<${\it formal arguments}$>$ . $<${\it arguments to next stage}$>$)\cr

There is also a formulation in which pipe stages are explicitly named and
can be invoked by name. This allows one to write a pipeline which is a
directed graph.

Software pipelining is also similar to stream processing, in which one
process supplies a stream of values to another. Software pipelining is
different in that it is useful for introducing concurrency to an
existing serial program by breaking it up into a stream with several stages.
Thus, it is not only a technique that is useful for thinking about programs
as processes which produce or consume a sequence of values, but it is also
useful for thinking about increasing the running speed of a program.

This viewpoint allowed us to introduce software pipelining into
programs which use global variables to enable early invocations of a
program to communicate with later invocations.

\sect Geometric Control Structures

In this section we will introduce a style of programming, which is
called {\it geometric control structuring}. 

Systolic arrays have been used extensively in numeric analysis computing.
The idea is that a network of computers organized as an array can be programmed
to perform operations---typically on arrays---by streaming data from
several computers in the array to a single computer. That single computer
performs some operation on the data and passes along its value to some other
computer.

The advantage of this style of programming is that geometric intuition about
the structure of a computation can be brought to bear by the programmer to
produce an effective and clear program. If there is a multi-processor which
corresponds to the program structure, then there may be a performance
advantage as well.

\ssect Motivation for Geometric Control Structures

The key observation about systolic arrays is that the control structure
corresponds very closely to the geometry (or topology)  of the problem.

Qlisp was designed as a language suitable for a true shared-memory
multi-processor, but only one of the two major sources of parallelism
in Qlisp is specifically designed for shared-memory systems.
The mechanisms are: parallel evalutation of arguments to a \llet/-like
form and {\it process closures}---a free-form style of processes and
process invocations.

Process closures were used to define the style of programming called
{\it software pipelining}, in which a
serial program can be converted one with some concurrency by breaking
it into stages. Process closures can be used to define any
hierarchical or heterarchical control structure, and it is the purpose
of this paper and the software pipelining paper to define styles of
parallel programming; we hope that these styles will help programmers
find ways to introduce parallelism in their programs.

The choice of a shared-memory multi-processor was made based on the belief that,
in the near term, such multi-processors would provide the best opportunity
to increase the speed of computers on programs similar to those
that exist today.  Qlisp was viewed as a movement vehicle from serial
Lisp-like languages to parallel Lisp-like languages on such machines.

But process closures are unstructured, and there is no reason that the
multi-processor needs to be shared-memory---only shared address space
is necessary. Process closures can be run on any topology of processors,
and if the process structure that the programmer chooses matches the
geometric or topological structure of the multi-processor, then so much the
better for the performance of the program.

For argument's sake, let us assume that the geometry of the non-shared-memory
multi-processor is an array or grid of processors, each with some amount of
memory. Memory requests from one processor to the memory of a second processor
are routed through a network of some sort.

\ssect Data-Structure-resident Closures

The key idea is to allocate processes within a data structure. If this
data structure is global, then each process within the data structure is able
to access other processes---that is, any process can invoke any other 
process that it is able to access through the data structure.

For example, suppose we have a list, and each element in the list is 
a process. Suppose that the only handle we have on the list is a
pointer to the process stored in the \smcp{CAR} of the list. 

Another example might be to define a 2-dimensional array, which could
correspond to some physical aspect of the problem the programmer wishes
to solve. If the solution to the problem requires that each element
in the array have an associated process, which performs some computation,
then the programmer can store a process closure in each element in the
array. These process closures can incorporate communications capabilities
to other elements in the array, perhaps limited to nearby neighbors.

Once the problem is solved in its own terms, attention can be turned towards
laying the data structure of process closures down on the physical
structure of the multi-processor. Perhaps the multi-processor is itself
a rectangular array and the mapping is simple. Perhaps not. The key is 
to solve the problem without much concern for the geometry of the
underlying hardware, leaving the matching of the software architecture
to the hardware architecture until later.

Software pipelining can be seen as a simple example of this idea, with
the underlying data structure being a simple list. Systolic arrays can
be viewed this way with some sort of grid as the underlying data
structure.

\sect Conclusions

We have presented a new language for multi-processing. A variant of Lisp,
this language features a unique and powerful diction for parallel programs.
Parallel constructs are expressed elegantly, and the language extensions
are entirely within the spirit of Lisp.

The problem of runtime tuning of a program is addressed and adequately solved.
The performance of programs written in this language as a function of the
size of the multi-processor is explored.

Multi-processors that support shared memory among processors is important,
and even some or all of the nodes in a distributed system should be
multi-processors of this style.  To achieve maximum performance we will
need to pull every trick in the book, from coarse-grained down to
fine-grained parallelism. This language is a step in the direction of
achieving that goal by allowing programmers to easily express parallel
algorithms.

\sect Acknowledgments

John McCarthy invented queue-based multi-processing as it is
discussed in this paper. In particular,  \qlet/ was his invention.
From that starting point, and with valuable discussions with both
John McCarthy and Jeff Ullman I was able to flesh out Qlisp as
it appears here. Of course Qlisp is not intended to be the final
design for a parallel Lisp; for example, Qlisp does not address
completely the question of distributed computing, which may turn out
to be a significant aspect of future parallel computers.

\references

\makeref:Correll.1979.{Correll, Steven.
\article{S-1 Uniprocessor Architecture (SMA-4)}
in \book{The S-1 Project 1979 Annual Report}, Chapter 4.  Lawrence
    Livermore National Laboratory, Livermore, California, 1979.}

\makeref:Gabriel.1982.{Gabriel, R. P., Masinter, L. M.
\article{Performance of Lisp Systems},
Proceedings of the 1982 ACM Symposium on Lisp and Functional
Programming, August 1982.}

\makeref:Gabriel.1984.{Gabriel, R. P., McCarthy, J.
\article{Queue-based Multiprocessor Lisp},
Proceedings of the 1984 ACM Symposium on Lisp and Functional
Programming, August 1984.}

\makeref:Halstead.1984.{Halstead, Robert,
\article{MultiLisp},
Proceedings of the 1984 ACM Symposium on Lisp and Functional
Programming, August 1984.}

\makeref:Smith.1978.{Smith, Burton J., \article{
A Pipelined, Shared Resource MIMD Computer} in
\book{Proceedings of the International Conference on Parallel Processors},
1978.}

\makeref:Steele.1978.{Steele, Guy Lewis Jr., and Sussman, Gerald Jay.
\article{The Revised Report on SCHEME: A Dialect of LISP},
      AI Memo 452, Massachusetts Institute of Technology Artificial
         Intelligence Laboratory, Cambridge, Massachusetts, January,
         1978.}

\makeref:Steele.1984.{Steele, Guy Lewis Jr. et. al.
\book{Common Lisp Reference Manual}, Digital Press, 1984.}

\makeref:Sussman.1975.{Sussman, Gerald Jay, and Steele, Guy Lewis Jr.
\article{SCHEME: An Interpreter for Extended Lambda Calculus},
      Technical Report 349, Massachusetts Institute of Technology
         Artificial Intelligence Laboratory, Cambridge, Massachusetts,
         December, 1975.}

\bye